-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not perform another pass of Query Automaton Minimization #8237
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8237 +/- ##
============================================
- Coverage 70.96% 64.00% -6.96%
+ Complexity 4320 4239 -81
============================================
Files 1626 1584 -42
Lines 85081 83250 -1831
Branches 12803 12608 -195
============================================
- Hits 60377 53288 -7089
- Misses 20545 26127 +5582
+ Partials 4159 3835 -324
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observe a good speed up with this change.
I got these numbers with JDK11 (coretto) on my MacBook Pro with the CLI args below:
So no integer multiple differences for unanchored prefixes in this run, anchored prefixes are much faster than lucene, but the native implementation appears to warm up faster. I can run this on some more stable machines, but we wouldn't see this kind of improvement by accident. |
Native text engine minimises the query automaton post construction using Hopcroft's algorithm. This can get expensive for large query automatons, and does not yield much improvement anyways since the query automaton is build once and use once.
Post this change, performance numbers using BenchmarkNativeAndLuceneBasedLike:
Benchmark (_fstType) (_intBaseValue) (_numBlocks) (_numRows) (_query) Mode Cnt Score Error Units
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 40.436 ± 8.662 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 50.320 ± 4.254 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 42.378 ± 2.669 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 53.890 ± 2.951 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 47.751 ± 1.149 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 60.890 ± 1.949 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 93.937 ± 8.493 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 129.687 ± 16.903 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 55.362 ± 10.320 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 16.610 ± 1.297 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 54.800 ± 1.501 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 18.417 ± 0.696 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 60.187 ± 3.858 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 25.549 ± 1.694 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 106.765 ± 13.996 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 99.888 ± 1.029 us/op
Note that for generic match queries '%domain%', Lucene and Native FST are at parity from 0 blocks to 100 blocks. For prefix queries, Native FST is 4x faster on 0 and 10 blocks, and 33% faster on 100 blocks.
This behaviour was observed over multiple runs of the benchmark. Detailed results at:
https://docs.google.com/document/d/1Jd-Oe0F9gx9WAB1sa5YdW7KZ_EsPOJdcHK1bsON9JHM/edit?usp=sharing